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University 

Individual-level health data are often not publicly available due to 
confidentiality; masked data are released instead. Therefore, it is im- 
portant to evaluate the utility of using the masked data in statistical 
analyses such as regression. In this paper we propose a data masking 
method which is based on spatial smoothing techniques. The pro- 
posed method allows for selecting both the form and the degree of 
masking, thus resulting in a large degree of flexibility. We investi- 
gate the utility of the masked data sets in terms of the mean square 
error (MSE) of regression parameter estimates when fitting a Gener- 
alized Linear Model (GLM) to the masked data. We also show that 
incorporating prior knowledge on the spatial pattern of the exposure 
into the data masking may reduce the bias and MSE of the parameter 
estimates. By evaluating both utility and disclosure risk as functions 
of the form and the degree of masking, our method produces a risk- 
utility profile which can facilitate the selection of masking parame- 
ters. We apply the method to a study of racial disparities in mortality 
rates using data on more than 4 million Medicare enrollees residing 
in 2095 zip codes in the Northeast region of the United States. 

1. Introduction. Individual-level information such as health data col- 
lected by, for example, government agencies, are often not publicly available 
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in order to preserve confidentiality. On the other hand, there is public de- 
mand on these individual-level data for research purposes. As an example, 
associations of individual health with various risk factors are of great interest 
and concern nowadays. Statistical research that addresses these two compet- 
ing needs is known as statistical disclosure limitation, where a large number 
of methods are developed on how to process and release information that 
is subject to confidentiality concern [Duncan and Lambert (1986); Fienberg 
and Willenborg (1998); Willenborg and Waal (1996, 2001)]. In this paper we 
refer to those methods that alter the original data values as "data masking." 
Corresponding to the two competing needs, a data masking method should 
be evaluated from both the utility of the masked data which represents the 
information retained after the masking, and the disclosure risk of the masked 
data which is the risk that a data intruder can obtain confidential informa- 
tion (e.g., obtain original data values and/or identify an individual to whom 
a data record belongs). Ideally, masked data would have low disclosure risk 
while preserving data utility as much as possible. 

Examples of commonly used data masking methods include aggregated 
tabular counts for categorical data [Fienberg and Slavkovic (2004)], data 
swapping which exchanges values between selected records, with its various 
extensions [Dalenius and Reiss (1982); Fienberg and Mclntyre (2005)], cell 
suppression where certain cells of contingency tables are not displayed [Cox 
(1995)], simulating synthetic data which have the same (conditional) dis- 
tribution as the original data [Rubin (1993); Fienberg, Makov and Steele 
(1998); Raghunathan, Reiter and Rubin (2003); Reiter (2003, 2005b)], and 
additive random noise for continuous variables [Kim (1986); Sullivan and 
Fuller (1989); Fuller (1993); Trottini et al. (2004)], etc. 

Among these methods, data aggregation, data swapping, additive ran- 
dom noise and many other methods can be formulated as matrix mask- 
ing [Duncan and Pearson (1991)]. Suppose data on n observations and p 
variables are stored in a n x p matrix. Matrix masking takes the general 
form of Z* = AZB + C, where Z is the original data matrix and Z* is the 
masked data matrix. Matrices A, B and C are row (observation) operator, 
column (variable) operator and random noise, respectively. Links between 
the above masking methods to matrix masking are investigated in Duncan 
and Pearson (1991), Cox (1994), Fienberg (1994) and Fienberg, Makov and 
Steele (1998). 

Measuring and evaluating utility of masked data is important. In general 
there are two classes of utility measures. One is global utility measures which 
reflect the general distribution of masked data compared to that of the 
original data and are not specific to any analysis. Such measures include 
the number of swaps in data swapping, the added variance in the additive 
random noise approach, differences between continuous original and masked 
data in their first and second moments, etc. More sophisticated measures 
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that compare distributions of masked and original data can be found in 
Dobra et al. (2002), Gomatam, Karr and Sanil (2005) and Woo et al. (2009). 
In addition, Bayesian decision theory-based utility is discussed in Trottini 
and Fienberg (2002) and Dobra, Fienberg and Trottini (2003). 

The second class of utility measures is analysis-specific tailored to an- 
alysts' inference. For the utility associated with regression inference, Karr 
et al. (2006) examine the overlap in the confidence intervals of linear regres- 
sion coefficients estimated with original and masked data. Kim (1986) and 
Fuller (1993) show for the additive random noise approach that if masked 
data preserve the first two moments of original data, then coefficient es- 
timates from linear regression using masked data are (approximately) un- 
biased. In addition, the methods of aggregated tabular counts and data 
swapping can produce valid results for loglinear models because they pre- 
serve the marginal total of contingency tables. This is equivalent to preserv- 
ing sufficient statistics for loglinear models, given that the margins of all 
higher-order interactions that appear in the model are preserved [Fienberg 
and Slavkovic (2004); Fienberg and Mclntyre (2005)]. Recently, Slavkovic 
and Lee (2010) investigated logistic regression inference for contingency ta- 
bles that preserve marginal total or conditional probabilities. However, for a 
general data structure additional research is needed. For example, bias and 
variance of parameter estimates from nonlinear regression using masked data 
are not quantified as functions of masking parameters. 

We propose a special case of matrix masking where we construct row 
(observation) transformed data, that is, Z* = AZ, using spatial smoothing. 
We investigate the mean square error (MSE) of the regression parameter 
estimates when fitting a Generalized Linear Model (GLM) to the masked 
data, and we provide guidance on how to select the masking parameters to 
reduce the MSE. Specifically, for both regressors and outcome we construct 
masked data which are weighted averages of the original individual-level 
data by using linear smoothers. The shape of the smoothing weight func- 
tion defines the "form" of masking and the smoothness parameter measures 
the "degree" of masking. By choosing an appropriate weight function and 
smoothness parameter value, the masked data can account for prior knowl- 
edge on the spatial pattern of individual-level data, and parameter estimates 
from nonlinear regression using such masked data may be less subject to bias 
and MSE. Although data utility is our main focus, we also evaluate iden- 
tification disclosure risk. We consider the scenario wherein a data intruder 
has correct information on the risk factor regressors (e.g., exposure or de- 
mographic data) from some external data sources, and his/her objective is 
to obtain the confidential information on the health outcome through record 
matching. Using our method, we can evaluate both the utility and the disclo- 
sure risk as functions of the form and the degree of masking, which produces 
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a risk-utility profile and can facilitate the selection of the masking parame- 
ters. We also derive a closed-form expression for calculating the first-order 
bias of the regression parameter estimates when estimated using the masked 
data, for any assumed distribution of the outcome given the regressors in 
the exponential family. 

We apply our method to a study of racial disparities in risks of mortality 
for a large sample of the U.S. Medicare population. This study consists 
of more than 4 million individuals in the Northeast region of the United 
States. We develop and apply statistical models to estimate the age and 
gender adjusted association between race and risks of mortality when using 
both the original individual-level data and the masked data. The estimated 
association obtained from using the original individual-level data is the gold- 
standard, and we compare it to the estimated association obtained from 
using the masked data. We also calculate the identification disclosure risk 
of the masked data sets. 

In Section 2 we detail the method, and in Section 3 we present the simula- 
tion studies. In Section 4 we apply our method to the Medicare data set, and 
in Section 5 we discuss the method and the results. The R code is provided 
in the Supplement [Zhou, Dominici and Louis (2010b)], while the Medicare 
data set is not provided due to a confidentiality agreement. Derivation of 
the closed-form expression for the first-order bias of the GLM regression 
parameter estimates when estimated using the masked data is presented in 
the Appendix. 

2. Methods. 

2.1. Matrix masking using spatial smoothing. Assume that the outcome 
variable Y and the regressors X are spatial processes {Y(s),X(s)}, and 
the observed individual- level data {(Yi, Xj), i = 1, . . . , N} are realizations of 
the spatial processes at locations s = {s\, . . . ,sn}, that is, Xj = X(sj), Yj = 
Y(si), i = 1, . . . , N. We construct masked data at s using spatial smoothing, 
and we show later that this masking approach is a special case of matrix 
masking by row (observation) transformation. 

Let W\(u, s;S) denote the relative weight assigned to data at location s 
when generating smoothed data for the target location it, where A > is a 
smoothness parameter, and S denotes all spatial locations in a study area 
so s is a subset of S. The parameter A controls the degree of smoothness, 
with smoothness increasing with A. For notational convenience we suppress 
the dependence of W on S. 

We consider a subclass of linear smoothers under which the smoothed 
spatial processes at location u are defined as follows. For A > 0, 
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(2.1) 

X A (u) = J X(s)W x (u,s)dN(s)/ J W x (u,s)dN(s), 

where N(s) is the counting process for locations with available data from the 
spatial processes {Y(s), X(s)}. For Vu € s we require that Wq(u, s) = Ii s=u \. 
If W is continuous in A, we define Wq(u, s) as lim^o W\(u, s). Therefore, we 
have that {Yo(sj), Xo(sj)} = {Yi,Xi}, the original individual-level data. 

We generate masked data by taking the predictions from (2.1) at s where 
the original individual-level data are available, that is, {Y\(si),'X.x(si),i = 
1, . . . ,N}. By definition in (2.1), the masked data are weighted averages of 
the original individual-level data {Y(si), X(sj)}. The shape of the weight 
function W and the degree of smoothness A control the form and the de- 
gree of masking, respectively, where the degree of masking increases with 
the degree of smoothness. In practice, the masked data at location Sj are 
computed by 

N , N 

Yx{s i ) = Y J Y k Wx{si,s k ) / ^W^s*), 

k=l ' k=l 

(2.2) 

N J N 

Xa(sO = ^2^kWx(si,s k ) / y^Wx(sj,s k ), 

k=l ' k=l 

where the same W and A are applied to both Y and X. Examples of com- 
monly used smoothers within this class include parametric linear regressions 
fitted by ordinary least square and weighted least square, penalized lin- 
ear splines with truncated polynomial basis, kernel smoothers and LOESS 
smoothers [Simonoff (1996); Bowman and Azzalini (1997); Hastie, Tibshi- 
rani and Friedman (2001); Ruppert, Wand and Carroll (2003)]. 

Let y and 3^, denote the vectors of {Yj} and {Yx(si)}, and let X and 
Xx denote the matrices of {Xj} and {X^(si)}, respectively, where Xj and 
Xa(s«),« = 1,...,N, are row vectors. It can be seen that 3^a = ^4a3^ and 
X x = A X X, where A x = (A Xij ) = (Wx(s h ^)/Ef=i W x ( 8i , Sj )). Therefore, 
constructing masked data by equation (2.2) is a special case of matrix mask- 
ing by row (observation) transformation. Reidentification from (yx,Xx) to 
(y, X) requires knowledge of both W and A as well as the existence of AT ■ 



2.2. Bias and variance in nonlinear regression using masked data. Bias 
may arise when a nonlinear model that is specified for the original individual- 
level data is fitted to the masked data. Specifically, we assume the following 
model for the original individual-level data which is viewed as the "truth," 



(2.3) 



g(E{y\X}) = X(3. 



G 



Y. ZHOU, F. DOMINICI AND T. A. LOUIS 




only for a linear function g(x) = ax, where a is a constant (except for few 
special circumstances such as Xj = x, i.e., constant exposure). Specifically, 



It follows that for a nonlinear regression model (2.3), the coefficient estimate 
obtained by fitting model (2.4) will be a biased estimate of (3. Therefore, it is 
important to evaluate the bias of the coefficient estimate under model (2.4) 
as well as how the bias varies as a function of the form and the degree of data 
masking. To consider both the bias and variance of the coefficient estimate 
obtained by fitting model (2.4), we evaluate the MSE as a function of the 
form and the degree of masking. 

It is common to assume that the masked data are mutually indepen- 
dent. However, they are generally correlated, since they combine information 
across the same original data. To investigate the impact of this correlation 
on the uncertainty of the coefficient estimate when using the masked data, 
we compare the "naive" confidence interval under model (2.4) which does 
not account for this correlation with an appropriate confidence interval ob- 
tained by using simulation or bootstrap methods [Efron (1979); Efron and 
Tibshirani (1993)]. 

2.3. Identification disclosure risk of masked data. We evaluate the iden- 
tification disclosure risk of the masked data by calculating the probability 
of identification as developed in Reiter (2005a). To compute the risk of the 
released masked data set, we first compute the probability of matching for 
a particular data record. 

Specifically, let Z = (y, X) denote the unmasked data set and Z x = (y x , X\) 
denote the released masked data set. Let t denote a data vector possessed by 
a data intruder, where t contains the true values for a particular individual. 
Z x can be divided into two components: which consists of variables that 

are not available in t, and zf p which consists of variables that are available 
in t. Zi = (Z U ,Z A P) is the same decomposition of the true data set. Let J 
be a random variable that equals j if to match t with the jth individual in 
Z\. The probability of matching is Pr(J = j\t, Z\),j = 1,. . . ,N, assuming 
that t always corresponds to an individual within Z x . Assumptions about 
the knowledge and behavior of the intruder are used to determine this prob- 
ability. Using Bayes' rule, 



g(E{y x \X x }) = aE{y x \X x } = aE{y x \X} 



aA x E{y\X] 



model (3) 



aA x a~ 1 X/3 = X x /3. 
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where Pr(Z>| J = j, t) can be decomposed into 

Pr(zA,l , • • • , ZA,j-l, ZA ■ • ■ , ZA,AT |ZAJ , J = j, t) 

•Pr(z A / J |z^,J = j,t).Pr(z^|J = j,t). 

Following the guidance in Reiter (2005a), we compute each component of 
Pr(J = _7'|t, Z>) as follows: 

1. Pr(J = j|t) = 1/N. This is because the true values are replaced by some 
weighted averages upon releasing, so exact matching between t and any 
Z^ p record is not possible. 

2. Pr(z^| J = j,t) equals 

(2.5) 1 - 



\<7 AP 



N II Ap 
max fc=l ll z A,/c 



which is the tail probability of a uniform distribution with density 1/ 



JV \\Ap 



v fc=1 \\z X k — 1||. We assume the intruder knows that the masked data 
are weighted averages of the original data. As we point out at the end 
of Section 2.1, detailed information on W and A shall not be released. 
Therefore, it is a reasonable assumption that the intruder will assume 
a uniform distribution based on the difference from t. The larger the 
difference^ the smaller the probability. 

3. Pr(z^|z^, J = j, t) is computed through 

(2.6) J Pr(z A / J |zf , zg, J = j,t) Pr(zf|zg, J = j,t) dzf, 

where Pr(^ |-f , J = J, t) = 1 - m Jf^ z u {{ , Pr(^ \^ >J = M 

is obtained through regression of Zi U on Z^ p , and the integral is computed 
using Monte Carlo integration. 

4. Pr(zA.i, . . . , za.j-i, zaj+i, • • • , Za ; jv| z Aj') ^ = J> *) is conservatively assumed 
to be equal to 1. As pointed out in Reiter (2005a), such assumption pro- 
vides the upper limit on the identification risks and greatly simplifies the 
calculation. 

Assuming a record t is matched to the individuals with the largest proba- 
bility of matching, we measure the identification disclosure risk of the entire 
released data set using the expected percentage of correct matches. Same as 
in Reiter (2005a), we assume that the intruder possesses correct records for 
all individuals in the released data set and seeks to match each record with 
an individual with replacement, that is, matching of one record is indepen- 
dent from matching of another record. Let rrij be the number of individual 
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records with the maximum matching probability for tj,j = 1,...,N. Let 
Ij = 1 if the rrij individual records contain the correct match, and Ij = 
otherwise. The expected percentage of correct matches is Y^f=i 

3. Simulation studies. 

3.1. Data generation, parameter estimation and disclosure risk evalua- 
tion. In this section we conduct simulation studies to illustrate that pa- 
rameter estimates from regression using masked data may be less subject to 
bias and MSE when the selection of the smoothing weight function accounts 
for the spatial patterns of exposure. We illustrate this point using three ex- 
amples. In each case, we define the study area to be [—1, 1] x [—1, 1]. Within 
this study area we randomly select 1000 locations as s where individual-level 
exposure and outcome data are obtained. 

In each example, we define a spatial process of exposure X(s) and we 
obtain X(si) for Si £ s. We simulate the individual-level outcome data at s 
from a model of the general form 

(3.1) Y( Si ) i ~ d - Poisson(e /1+/3X(Sl) ), 

with the individual-level exposure coefficient /3 being the parameter of in- 
terest. The values of \jl and /3 are selected to achieve reasonable variability 
of E{Y(si)\X(si)} under model (3.1) across the locations. 

We construct the masked data {Y\(si), X\(si)} using kernel smoothers, 
and we estimate the exposure coefficient fl\ under model 

(3.2) Y x { Si ) L ~ ' Poisson(e^ +/ ^ x ^), 

which is analogous to model (3.1) but fitted to the masked data. The masked 
data are constructed and (3\ is estimated for each combination of 20 A values 
and two different kernel weights, respectively, so we can evaluate the bias 
and the MSE as functions of both the smoothing weight and A. 

In addition, we construct spatially aggregated data by equally partitioning 
the study area into 7 x 7 = 49 cells and calculating Y + j = Yli=i Y( s i) an d 
X.j = Yli=iX(si)/nj, where nj is the total number of individual-level data 
points in cell j, j = 1, . . . ,49. We estimate the exposure coefficient (3 e using 
the aggregated data {Y + j,X.j} under the analogous model 

(3.3) Y +j L ~ - nj • Poisson(e^ +/3e ^ )- 

To evaluate the identification disclosure risk, we consider the scenario 
that a data intruder possesses the correct exposure data, that is, X(si) for 
Si € s, and seeks the matches with the released data set in order to obtain 
information on the health outcome Y. Specifically, Zi Ap is X and is Y . 

We generate 500 replicates of the individual-level outcome data. For each 
replicate (3\ and f3 e are estimated as above, and the estimates are averaged 
across the 500 replicates. 
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3.2. Choice of smoothing weight function. To select a weight function 
that may lead to less bias and possibly smaller MSE when estimating the 
exposure coefficient using the masked data, we notice that expectation of 
the masked outcome Y\(si) with respect to model (3.2) is 

E{Y x ( Sl )\X x ( Sl )} = e^ x ^>\ 

while expectation of Y\(si) with respect to model (3.1) is 

E{Y x ( Si )\X} = J e^ x ^W x (s u s)dN( s ) 

where X = {X(s)}. The comparison between E{Y\(si)\X.} and 
E{Y\(si)\X\(si)} suggests that we can reduce the bias and possibly the 
MSE of estimating fx and (3 when using the masked data by selecting a W 
s.t. j e^ x ^~ Xx ^ Si ^ W\(si, s) dN(s) is close to 1. One way to construct such 
a W is to assign high weights to locations that receive similar exposure as 
the target location and low weights otherwise. The W constructed in this 
way has the property that it accounts for prior knowledge on the spatial 
pattern of the exposure. In our examples, this is also the spatial pattern of 
the outcome due to the model assumption (3.1). Therefore, to assess the 
difference in bias and MSE when varying the smoothing weight function, we 
construct two different kernel weights for data masking in the way that one 
weight accounts for prior knowledge on the spatial pattern of the exposure 
as above, while the other does not. 

3.3. Example I. We assume that the exposure is eradiated from a point 
source A and decreases symmetrically in all directions as the Euclidean 
distance from A increases. Specifically, we define -Xi(s) = 7exp(— r 2 /2.5) 
for s € [—1,1] x [—1,1], where r s is the Euclidean distance between lo- 
cation s and the point source A. Figure 1(a) shows the contour plot of 

X\(s). The individual-level outcome is simulated from Yi(sj) 
Poisson^ -25 " 1-4 ^ 1 ^ 4 )). Aggregated data of exposure and outcome are con- 
structed by calculating group summaries of {Yi(si), Xi(st)} as described in 
Section 3.1. 

We construct masked data {Yia(si), Xi\(si)} by using equation (2.2) with 
both the Euclidean kernel weight W? and the ring kernel weight W±\ which 
are defined as follows: 

(3.4) Wx(u,s) = exp(-||s-u|| 2 /A), 

(3.5) W lx (u,s) = eM-\r 2 s -rl\/\). 
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The ring kernel weight Wi\(u,s) decreases exponentially as the difference 
between r 2 s and r\ increases, and such difference is positively associated with 
the difference between X\(s) and X\(u) according to the spatial pattern of 
the exposure. Figure 1(b) shows the contour plot of Wi\(si, •). On the other 
hand, the Euclidean kernel weight W£(u,s) solely depends on \\s — u\\, the 
Euclidean distance between location u and location s, and therefore does 
not account for prior knowledge on the spatial distribution of the exposure. 





-1.0 -OS 00 "■■ 10 -1.0 -4£ 00 0.S 9 



T 1 1 1 

flCC 610 OSO 030 0.40 




0.O2 0.10 



Fig. 1. Example I of spatially varying exposure, weight function for spatial smooth- 
ing, estimates, and disclosure risk, (a) Contour plot of exposure from point source 
A: Xi(s) = 7exp(— rJ/2.5), with cells for spatial aggregation, (b) Contour plot of ring 
weight function Wi\(si,s) = exp( — \r'j, — rfj/A) for calculating spatially smoothed ex- 
posure and outcome data at location Si, from individual-level exposure Xi(s) in (a) 
and individual-level outcome Yi(s) simulated by Yi(s) ~ Poisson(exp( — 25 + 4Xl(s))) 
where /3 = 4, with X — 0.5. (c) Estimates of /3\ with "naive" 95% confidence intervals 
by fitting model Yi\(s) ~ Poisson(exp(/iA + /3\Xi\(s))) where {Yi\(s),Xi\(s)} are con- 
structed using the ring weight function in (b) and using the Euclidean weight function 
Wx(si,s) = exp( — ||s — si|| 2 /A), with reference lines at j3 = 4 and at the estimate from 
aggregated data, (d) Mean square error (MSE) of /3\ using "naive" variance, (e) Identi- 
fication disclosure risk measured by the expect percentage of correct record matching, (f) 
Disclosure risk versus MSE for utility-risk trade-off. 
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3.4. Example II. We assume that the exposure is eradiated from a point 
source A and toward a certain direction. Specifically, we define ^(s) = 7 x 
exp(— r 2 /6 — cos# s /3) for s € [—1,1] X [—1,1], where 9 S is the angle between 
the direction from point source A to location s and the direction that the 
exposure is toward, and r s is defined the same as in Example I. Figure 2(a) 
shows the contour plot of ^(s). The individual-level outcome is simulated 

from Y2(si) Poisson(e~ 36+4 ^ 2 '- s ^). Aggregated data of exposure and out- 
come are constructed by calculating group summaries of {Y^Si) , X2(si)} as 
described in Section 3.1. 

We construct masked data {Y 2 \(si), X 2 a(sj)} by using equation (2.2) with 
the Euclidean kernel weight (3.4) and the ring angle kernel weight 

W 2 \(u,s) = exp(-(|r 2 -r 2 | + 2|cos<9 s - cos0 u |)/A), 

which decreases exponentially as the difference between r 2 and r 2 increases 
as well as the difference between cos# s and cos9 u increases. Figure 2(b) 
shows the contour plot of W2x(s±, •). 

3.5. Example III. We assume that the exposure is eradiated from a point 
source A but blocked in a certain area, such as blocked by a mountain, so the 
blocked area receives no exposure. Specifically, we define the unblocked area 
to be s x < 0.4 or cos# s < 0.625 for s € [—1, 1] x [—1, 1], where s x is the x-axis 
value of location s and t? s is the angle between the positive x-axis and the 
direction from point source A to location s. We define the exposure X$(s) = 
7exp(— r 2 /2.5) • I s for s 6 [—1, 1] x [—1, 1], where I s is the indicator that s is 
located within the unblocked area, and r s is defined the same as in Examples 
I and II. Figure 3(a) shows the contour plot of X 3 (s). The individual- level 

outcome is simulated from Y^(si) Poisson(e _24+4 " 5s ' 3 ^^). Aggregated data 
of exposure and outcome are constructed by calculating group summaries 
of {Ys(si), Xs(si)} as described in Section 3.1. 

We construct masked data {Y 3 x(si), X 3 \(si)} by using equation (2.2) with 
the Euclidean kernel weight (3.4) and the ring block kernel weight 

W 3X (u, s) = exp(-|r s 2 - r 2 u \/X) ■ (I s = I u ), 

which assigns nonzero weight only when location u and location s are both 
in the blocked or unblocked area. In addition, the nonzero weight from 
W 3 \(u,s) decreases exponentially as the difference between r 2 and r 2 in- 
creases. Figure 3(b) shows the contour plot of Ws\(si, •). 

3.6. Results. Results of Example I on parameter estimates, MSE and 
identification risk averaged across the 500 simulation replicates are shown in 
Figure l(c)-(e), respectively. Specifically, Figure 1(c) shows the estimated 
(3\ as a function of A for the ring kernel weight (3.5) and the Euclidean 
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Fig. 2. Example II of spatially varying exposure, weight function for spatial 
smoothing, estimates, and disclosure risk, (a) Contour plot of exposure from 
point source A toward a certain direction: ^(s) = 7exp(— — cos# s /3), with 
cells for spatial aggregation, (b) Contour plot of ring angle weight function 
W2\(si,s) = exp( — (\r^ — rfj + 2|cos6 l s — cos# ai |)/A) /or calculating spatially smoothed 
exposure and outcome data at location si, from individual-level exposure X2(s) in (a) and 
individual-level outcome Y2 (s) simulated by Y2 (s) ~ Poisson(exp(— 36 + f3X2(s))) where 
/3 — 4:, with A = 0.5. (c) Estimates of /3\ with "naive" 95% confidence intervals by fit- 
ting model Y2\(s) ~ Poisson(exp(/iA + /3\X2\(s))) where {Y2\(s),X2\{s)} are constructed 
using the ring angle weight function in (b) and using the Euclidean weight function 
Wx(si,s) = exp(— \\s — si 1 1 2 / A) , with reference lines at /3 = 4 and at the estimate from 
aggregated data, (d) Mean square error (MSE) of j3\ using "naive" variance, (e) Identi- 
fication disclosure risk measured by the expect percentage of correct record matching, (f) 
Disclosure risk versus MSE for utility-risk trade-off. 



kernel weight (3.4), with the "naive" 95% confidence intervals. By "naive" 
we mean that the confidence intervals are computed by fitting model (3.2) 
directly, and therefore do not account for the possible correlation between 
the masked data as pointed out earlier in Section 2.2. The reference lines 
are placed at the true value of /3 and at the estimated [3 e , from which the 
bias of estimating the exposure coefficient by using the estimated /3\ can 
be evaluated. Figure 1(d) shows the MSE as a function of A for the two 
kernel weights, where in this example MSE is largely determined by the 
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Fig. 3. Example III o/ spatially varying exposure, weight function for spatial smooth- 
ing, estimates, and disclosure risk, (a) Contour plot of exposure from point source A 
but blocked in certain area: X 3 (s) — 7exp(— r 2 /2.5) • I s where I s is the indicator of loca- 
tion s in the unblocked area, with cells for spatial aggregation, (b) Contour plot of ring 
block weight function W 3 x(si, s) = exp( — |r 2 — r% j/A) ■ (J s = I S1 ) for calculating spatially 
smoothed exposure and outcome data at location si, from individual-level exposure X 3 (s) in 
(a) and individual-level outcome Y 3 (s) simulated by Y 3 (s) ~ Poisson(exp(— 24 + ,3X3(3))) 
where f3 = 4, with A = 0.5. (c) Estimates of /3\ with "naive" 95% confidence intervals 
by fitting model Y 3 x{s) ~ Poisson(exp(/iA + [3\X 3 \(s))) where {Y 3 x(s),X 3 x(s)} are con- 
structed using the ring block weight function in (b) and using the Euclidean weight func- 
tion Wx(si, s) — exp( — 1| s — si || 2 /A), with reference lines at (3 = 4 and at the estimate from 
aggregated data, (d) Mean square error (MSE) of /3\ using "naive" variance, (e) Identi- 
fication disclosure risk measured by the expect percentage of correct record matching, (f) 
Disclosure risk versus MSE for utility-risk trade-off. 



bias. The reference lines are placed at the MSE from regression using the 
original data (in which the bias part is 0) and the MSE of f3 e . Figure 1(e) 
shows the identification disclosure risk of the masked data set measured by 
the expected percentage of correct record matching, as a function of A for 
the two kernel weights. Figure 1(f) plots the disclosure risk versus MSE, 
which shows the trade-off between data utility and disclosure risk. 

We find that data masking using the ring kernel weight (3.5) leads to 
smaller bias and MSE when estimating the exposure coefficient than mask- 
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ing using the Euclidean kernel weight (3.4), for all A values that are con- 
sidered. It suggests that when using the masked data for loglinear regres- 
sion, a masking procedure that preserves the spatial pattern of the original 
individual-level exposure and outcome data can lead to better estimates in 
terms of smaller bias and MSE than a masking procedure that does not 
do so. As A increases, the bias and MSE increase for both kernel weights, 
while the differences in the bias and MSE between the two kernel weights 
decrease. This increase in the bias/MSE and decrease in the bias/MSE dif- 
ferences suggest that in the presence of a high degree of masking, choice 
for the form of masking may be less influential on the resultant bias/MSE. 
Moreover, comparing the estimated (5\ and /3 e , we find that for small values 
of A, the bias and MSE is smaller when using the estimated (3\ from the 
ring kernel weight (3.5). 

On the other hand, we find that the disclosure risk is lower when using the 
Euclidean kernel weight (3.4) for data masking compared to using the ring 
kernel weight (3.5). This is not unexpected because masked data constructed 
using the ring kernel weight is more informative about the original true 
values. However, with a tolerable potential disclosure risk [<0.2 which is used 
as an example cutoff in Reiter (2005a)], masked data when constructed using 
the ring kernel weight can lead to better MSE which cannot be achieved by 
using the Euclidean kernel weight with a comparable A. Same as the trend 
for bias and MSE, the differences in the disclosure risk between the two 
kernel weights become small as A increases. 

Similar results of Example II and Example III are shown in Figure 2(c)-(f ) 
and Figure 3(c)-(f). 

Figure 4 shows the width ratios comparing the 95% "naive" confidence 
intervals versus the percentile confidence intervals obtained from the empiri- 
cal distribution of the estimates across the 500 simulations, for the estimates 
of (3\ in the three examples respectively. Width ratio when A = (the solid 
dot) is calculated using the nonsmoothed data, that is, the individual-level 
data. We find that in these three examples, the "naive" confidence intervals 
generally overestimate the uncertainty of the (3\ estimates, and the degree 
of overestimation increases as A increases. In addition, for Examples II and 
III where the spatial patterns of exposure are nonisotropic, the degree of 
overestimation differs for the weight functions with and without accounting 
for prior knowledge on the spatial pattern of exposure. 

4. Application to Medicare data. We apply our method to the study 
of racial disparities in risks of mortality for a sample of the U.S. Medicare 
population. 

4.1. Data source. We extract a large data set at individual- level from 
the Medicare government database. Specifically, it includes individual age, 
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Fig. 4. Width ratios comparing the 95% "naive" confidence intervals (CI) versus the 
percentile CI obtained from the empirical distributions of the estimates across the 500 
simulations, for the estimates of j3\ in (a) Example I, (b) Example II, and (c) Example 
III of the simulation studies. Width ratio when X — (the solid dot) is calculated using the 
nonsmoothed data. 



race, gender and a day-specific death indicator over the period 1999-2002, 
for more than 4 million black and white Medicare enrollees who are 65 
years and older residing in the Northeast region of the U.S. People who are 
younger than 65 at enrollment are eliminated because they are eligible for 
the Medicare program due to the presence of either a certain disability or 
End Stage Renal Disease and therefore do not represent the general Medicare 
population. 

Figure 5 shows the study area which includes 2095 zip codes in 64 counties 
in the Northeast region of the U.S. We select the counties whose centroids 
are located within the range that covers the Northeast coast region of the 
U.S., and we exclude zip codes without available study population from the 
study map. This area covers several large, urban cities including Washington 
DC, Baltimore, Philadelphia, New York City, New Haven and Boston. It has 
the advantage of high population density and substantial racial diversity. 

We categorize the age of individuals into 5 intervals based on age in his /her 
first year of observation: [65, 70), [70, 75), [75, 80), [80, 85) and [85, +). 
This categorization facilitates detection of age effects because differences 
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Fig. 5. Location of the 2095 zip codes included in our study area. 



in the risks of mortality for one-year increase in age are relatively small. 
We "coarsen" the daily survival information into yearly survival indicators. 
By doing so, we define our outcome as the probability of the occurrence of 
death for an individual in one year. This definition adjusts for the differential 
follow-up time. 

4.2. Statistical models and data masking. Let % denote individual, j de- 
note zip code, t denote year, and D^t be the death indicator for individual 
i in zip code j in year t. Similarly as in Zhou, Dominici and Louis (2010a), 
we define the individual-level model as 

logitPr(Aij = 1) = A) + Piraceij + age ij (3 2 

(4.1) 

+ P^gender^ + (age x gender 

Geographic locations for each individual are needed to spatially smooth 
the individual-level data. However, from the Medicare data we only have 
the longitude and latitude of the zip code centroids. Therefore, we apply 
a two-step masking procedure on the individual-level data, where we first 
aggregate the individual-level data to zip code-level, and we then spatially 
smooth the zip-code level aggregated data to construct the masked data at 
the zip code-level. 

Specifically, let D++j denote the total death count and nj denote the 
total person- years of zip code j. We first obtain from aggregation {% black 
% agecatj, % malej, % (agecat x male)j, pj = D ++ j/nj, j = 1, . . . , J}, which 
are the marginal distributions of race, age, gender, the joint distribution of 
age and gender, and the mortality rate, respectively, of each zip code. 
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Due to the complex spatial pattern of the zip code- level covariates, we 
use kernel smoothers with bivariate normal density kernel weights for spatial 
smoothing, so the shape of the smoothing weight is flexible by varying the 
correlation parameter value of the bivariate normal distribution. Let the 
vector s = {si,s 2 } denote the location of a zip code, where s\ and s 2 are 
the longitude and latitude of the zip code centroid, respectively. We use 
smoothing kernel weights of the general form 

W\(u,s) = exp(-(si - ui,s 2 - u 2 ) T T,^ 1 (s 1 - u 1 ,s 2 - u 2 )/2), 

where 

VpO-102 02 / 

a\ and a\ are the variances of the longitude and latitude data of the 2095 
zip codes, respectively. We consider for p the following three values: 

1. p = 0, so the weight solely depends on the Euclidean distance \\s — u\\; 

2. p = 0.5, so higher weight is assigned to s in the northeast and southwest 
directions of u; 

3. p = —0.5, so higher weight is assigned to s in the northwest and southeast 
directions of u. 

Let pj\ denote the smoothed mortality rate of zip code j from which 
we calculate the smoothed death count D ++ j\ = Pj\ • rij. Let % black j\, 
% agecatj X , % malej\, % (agecat x male)j\ denote the smoothed marginal 
distributions of race, age, gender and the smoothed joint distribution of age 
and gender, respectively, of zip code j. We specify the model for masked 
data as 

£) ++jA~Bin(n i ,p jA ), 
(4.2) logitpjA = Au + blackjx + /3 2A % agecat jX 

+ f^3\% malejx + /3 4A % (agecat x male)jx- 

The zip code-level nonsmoothed aggregated data are also used to fit model 
(4.2). 

To evaluate the identification disclosure risk, we consider the scenario that 
a data intruder possesses correct zip code-level demographic data and seeks 
the matching with the masked zip code-level data set in order to obtain 
information on the zip code-level mortality. Specifically, the released data 
set consists of % blackjx, % agecat jA , % malejx an d Pj\,j = 1, • • - , 2095, and 
the data intruder possess the correct % black j, % agecat j and % malej. 
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4.3. Choice of association measure. The common approach to report the 
association between race and mortality risks is to report the race coefficients 
ft\ in model (4.1) and /3±\ in model (4.2), whose interpretation is subjected to 
the coding of the race covariate. For direct understanding of the difference 
in the risk of death between the black and white populations, we define 
and report the population-level odds ratio (OR) of death comparing Blacks 
versus Whites, which is a function of the predicted values [Zhou, Dominici 
and Louis (2010a)]. Therefore, interpretation of this association measure 
does not depend on model parameterization (e.g., on covariate centering 
and scaling). 

Specifically, let 

Ptijb = Pr(Z) t y = l\raceij = Black, age^, gender^), 
Ptijw = Pr(Aij = l\raceij = White, age tj , gender^) 

denote the predicted probabilities of death in year t for a black person and 
a white person, respectively, whose other covariates values are the same as 
the ith individual in the jth zip code. We define the population-level OR 
from the individual-level model (4.1) as follows: 



OR 

where 



P-bQ-w 

P—wQ---i 



P—h — ^ ^ Ptijbj P- w — ^ ^ Ptijwt 

t,i,j t,i,j 

Q-b — 1 P-bi Q-w — 1 P-w 

Similarly, we define population-level OR\ from model (4.2) using summary 
probabilities 

_ E j n j PjbX _ Y, j n j P jwX 

-r-6A — — and K w \ — — ^ , 

E j ".; Ej "j 

where Pjb\ and Pj w \ are the predicted probabilities of death in one year 
for zip codes that consist of solely black and solely white populations, re- 
spectively, and whose marginal and joint distributions of age and gender 
are the same as zip code j. "Naive" standard errors of log OR\ are cal- 
culated using the multivariate Delta Method [Casella and Berger (2002)]. 
In addition, bootstrap confidence intervals for log OR\ are calculated using 
1000 nonparametric bootstrap samples. Both "naive" and bootstrap confi- 
dence intervals for OR\ are obtained by exponentiating the corresponding 
confidence intervals for log OR\. 
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Fig. 6. Estimates of OR\ under model (4.2) and identification disclosure risk as a func- 
tion of X for the three weight functions. Estimates of OR\ is plotted with the 95% "naive" 
confidence intervals (CI), CI using bootstrap standard error (SE) estimates, and bootstrap 
percentile CI. ORo is estimated by fitting model (4.2) to the nonsmoothed zip code-level ag- 
gregated data, (a) Estimates of OR\ for bivariate normal density kernel weight with p — Q. 
(b) Estimates of OR\ for bivariate normal density kernel weight with p = 0.5. (c) Esti- 
mates of OR\ for bivariate normal density kernel weight with p = —0.5. (d) Identification 
disclosure risk measured by expected percentage of correct matching. 



4.4. Results. Figure 6(a)-(c) shows the estimates of OR\ under model (4.2) 
as a function of A for the three kernel weights respectively, with the 95% 
"naive" confidence intervals, confidence intervals using bootstrap standard 
error estimates and bootstrap percentile confidence intervals. ORq is esti- 
mated by fitting model (4.2) to the nonsmoothed zip code-level aggregated 
data. The reference line is placed at the estimate of OR under the individual- 
level model (4.1). 

For small values of A (<0.1), the estimates of OR\ for all three kernel 
weights are smaller than the estimate of OR and therefore produce negative 
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bias, while for larger values of A the bias differs substantially for different 
kernel weights. For example, data masking using the kernel weight with 
p = 0.5 leads to consistent underestimation of the odds ratio for all A values 
that are considered. When using the kernel weight with p = —0.5 for data 
masking, the estimates of OR\ are less subject to bias than those from using 
the other two kernel weights. Differences in MSE between the three kernels 
can also be inferred from the plots, and we find that using the kernel weight 
with p = —0.5 leads to much smaller MSE than using the other two kernels. 
For all three kernel weights, the "naive" confidence intervals underestimate 
the uncertainty of the OR\ estimates, which is in the opposite direction of 
the relation between the "naive" and the appropriate confidence intervals 
in the simulation studies. The two bootstrap confidence intervals are wider 
than the "naive" confidence interval when A = 0, which suggests a systematic 
difference between the bootstrap confidence intervals and the "naive" con- 
fidence intervals regardless of smoothing. This systematic difference occurs 
because the nonsmoothed zip code-level aggregated data may not satisfy the 
Binomial model assumption in (4.2). 

Figure 6(d) shows the identification disclosure risk of the masked data 
set as measured by the expected percentage of correct matches when using 
the three kernel weights for masking, as a function of A. The disclosure risk 
for all three kernel weights are small, ranging from 0.01-0.04. The risk is 
similar for the masked data sets when using the kernel weight with p = 
and p = —0.5 for masking, and the risk when p = 0.5 is slightly higher. 

5. Discussion. We propose a special case of matrix masking based on 
spatial smoothing techniques, where the smoothing weight function controls 
the form of masking, and the smoothness parameter value directly measures 
the degree of masking. Therefore, data utility and disclosure risk can be 
calculated as functions of both the form and the degree of masking. In fact, 
the smoothing weight function W can be any weight function and is not re- 
stricted by existing smoothing methods. With the variety of combinations of 
weight functions and smoothness parameter values, it is feasible to construct 
masked data that maintain high data utility while preserving confidentiality. 

We consider a subclass of linear smoothers that produces masked data as 
weighted averages of the original data. Therefore, the masked data values 
are within a reasonable range. More importantly, correlation among the 
variables is invariant under linear transformation, which may intrinsically 
contribute to better data utility of the masked data. On the other hand, 
this subclass is a large class. It includes many commonly used smoothers. 
We do not expect major restriction by focusing on this subclass of linear 
smoothers. 

Using our method, we investigate the utility of the masked data in terms 
of bias, variance and MSE of parameter estimates when using the masked 
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data for loglinear and logistic regression analysis. Note that similar studies 
can be applied to any GLM. In addition, we evaluate the identification dis- 
closure risk of the masked data set by calculating the expected percentage of 
correct record matching. In the simulation studies, we provide guidance for 
constructing masked data that can lead to better regression parameter esti- 
mates in terms of smaller bias and MSE for loglinear models, and we show 
the trade-off between better estimates and lower disclosure risk. Specifically, 
masked data can be constructed by using a smoothing weight function that 
accounts for prior knowledge on the spatial pattern of individual-level expo- 
sure, together with a reasonably low degree of masking. We provide guidance 
for how to select such a smoothing weight function for loglinear models. In 
addition, we provide candidate weight functions for three simplified but rep- 
resentative spatial patterns of exposure. 

As is expected, masked data that can lead to better estimates are generally 
more informative about the original data values and therefore are subject 
to relatively higher identification disclosure risk. However, the flexibility in 
our data masking method enables constructing the masked data that can 
lead to good parameter estimates, while the disclosure risk is controlled at 
a low level. In the meanwhile, caution should be placed to the institute 
in releasing detailed information on the data masking approach along with 
masked data. It is pointed out in Section 2.1 that simultaneously releasing 
the smoothing weight function W and the smoothness parameter A in the 
existence of A^ 1 can lead to reidentification of original data. However, even 
if only partial information is released, for example, only the information 
that data are masked using smoothing and the smoothing weight function is 
released while the smoothing parameter value is not released, it is possible 
that a smart data intruder can still reconstruct the transformation matrix 
Ax- 

We apply our data masking method to the study of racial disparities in 
risks of mortality for the Medicare population, and show how the bias and 
the variance of the estimated OR of death comparing blacks to whites, and 
how the identification disclosure risk, vary with the form and the degree 
of masking. The results suggest that in the absence of clear guidance, it is 
helpful to explore a large flexible family such as the bivariate normal density 
kernel to identify a weight function that can lead to both good utility and 
low identification risk for the masked data. 

We compare the "naive" confidence intervals with the appropriate ones 
which account for the possible correlation among masked data in both the 
simulation studies and the data application, where we observe opposite di- 
rections in the relation between the "naive" and the appropriate confidence 
intervals. It suggests no general direction for that relation. One possible rea- 
son, which is also pointed out in Section 4.4, is that the unmasked data in 
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the simulation study are simulated from Poisson distributions, while the un- 
masked data in the data application are real data and do not strictly follow 
the assumed binomial distribution. Therefore, in the data application, the 
standard errors account for both the correlation among the masked data 
and the discrepancy of the original data distribution from binomial. 

The simulation study and data application results show that masked data 
constructed using our method can well preserve confidentiality. Specifically, 
the identification disclosure risk is reasonably low for all scenarios that we 
consider. Note that our calculation of the disclosure risk is conservative: we 
assume that an intruder possesses true values for all the regressors, and we 
use probability 1 for the component Pr(z^ i, . . . , zaj-1; z aj+i ; • ■ • , Z \,N \ z \,j> 
J = j,t) in the calculation. In addition, the flexibility in the selection of 
smoothing weight function W and smoothness parameter A can also help 
control disclosure risk in addition to improving data utility. 

Based on our method, we additionally derive a closed-form expression 
for first-order bias of the parameter estimates obtained using the masked 
data, for GLM that belong to the exponential family. The first-order bias 
calculation is not necessary when both individual-level exposure and health 
outcome data are available so the actual bias can be computed. It may be 
used by researchers who have only the individual-level exposure information 
to explore the possible bias in their analysis using masked data. 

Although our proposed method uses spatial smoothing and therefore ap- 
plies to spatial data, it can be easily generalized to other data types because 
the masking procedure is a smoothing technique that takes weighted aver- 
ages of the original data. For example, the proposed method can be general- 
ized to smoothing time series data by using the smoothing weight function 
W\((i, s), where fi and s denote time points. Also, note that an alternative 
method to mask spatial data is to mask the individual spatial location [see 
Armstrong, Rushton and Zimmerman (1999); Wieland et al. (1998)]. 



We derive a closed-form expression for the first-order bias of estimating 
the regression coefficients in a GLM that belongs to the exponential family, 
when using data masked by our method. Let (3 denote the vector of regres- 
sion coefficients of a model specified for the original individual-level data. 
When the model belongs to the exponential family, its log likelihood can be 
expressed as 



6'(Xj/3) = g 1 (Xj/3), where &'(•) is the derivative of function b(-), and g(-) 
is the link function. Substituting the individual-level data {i^,Xj} by the 
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masked data {Y\(si) ,X^(sj)} , we obtain log likelihood of the analogous 
model when fitted to the masked data, 

(A.1) LL m( /3 A ; A) = £ ^(^Xa(,)/3^(X a (.)/3 a ) + c ( y (ai)M 

where /3 A denotes the corresponding vector of regression coefficients. In order 
to calculate the MLE of /3 A , it is common procedure to calculate the score 
function from the likelihood (A.l) and take its expectation with respect 
to the "true" individual-level model i£{li|Xj}. Denote the expected score 
function as S(A,/3 A ) and denote /3(A) as the solution s.t. 5(A,/3(A)) = 0. It 
can be shown that /3(0) = j3. Taking the derivative of 5(A,/3(A)) = with 
respect to A and evaluating it at A = 0, we obtain the standard result: 

(A.2) /3'(0) = -{S 2 {^m)~ l ■ Si(0,/3(0)), 

where S± and S2 are the partial derivatives with respect to the first and 
second components of dS/dX, respectively. Specifically, 

S\(0,/3(0))=^Xf ( f h(X(s)P)Ro(si,s)dN(s) 

i=l ^ 

(A.3) 

S 2 (0,/3(0)) 

where R ( Si ,s) = ^(s^s)/ Jw x (s t , s )dN( s) ) ^ and h ^ =g -!^ inyerge of 

the link function of the GLM. In practice, Si(0, /3(0)) in (A.3) is calculated 
by substituting the the integrals by summations over all locations where the 
original individual-level data are available. 

The quantity /3'(0) denotes the instant bias of estimating f3 using masked 
data, when changing from no masking to a very low degree of masking. As 
expected, when (i) X(s) is constant across all locations in s, or (ii) g(-) is a 
linear function, 5*1(0, (3(0)) is calculated to be 0, and therefore /3'(0) =0. 

Using /3'(0), we can approximate the bias of estimating (3 when fitting a 
GLM using masked data whose degree of masking is A, by calculating 

/3(A)-/3«/3'(0)-A. 

This bias calculation can be extended to any function of /3, for example, the 
predicted value. Specifically, bias in estimating f((3) can be approximated 
by 

/(/3(A)) - /(/3) « f'((3) ■ (/3(A) - 0) « f\(3) ■ /3'(0) • A. 



-fc 7 (Xi/3) J X(s) T R {s i ,s)dN(s)-(3 

N 

/i'(X/3)-Xfx 4 , 



1=1 
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It can be seen that the first-order bias approximation can be easily gen- 
eralized to approximation using higher-order terms of the Taylor series ex- 
pansion in addition to the first-order term. Specifically, 

/3(A) - /3 « 13' (0) • A + /3"(0) • X 2 /2 + • • • 

(A.4) 

+ /3 (n) (0)-A"/n!, n>l. 

Similarly, we can generalize the bias approximation of estimating f((3). 

A limitation of the bias approximation using Taylor series expansion (A.4) 
is that we ignore the remainder term /3^ n+1 ^(£) • (^+1)! > £ ^ (0> which may 
not be small for large values of A. Therefore, the approximation only captures 
the bias for A ~ 0, that is, the instant direction and magnitude of the bias 
when changing from no masking to a very low degree of masking. It may not 
capture the total bias for a specified degree of masking. In the application 
of our method to the Medicare data, the first-order bias is calculated to 
be for all three kernel weights because Rq in (A. 3) equals 0. In addition, 
when applying the bias approximation (A.4) to the three examples in the 
simulation studies for n = 1,...,5, the bias approximation is calculated to 
be 0, while nonzero bias is shown by comparing the parameter estimates 
when using the masked data with the true parameter value. 

Acknowledgment. Thanks to Dr. Aidan McDermott for the help on the 
Medicare data sources. 

SUPPLEMENTARY MATERIAL 

Supplement: R code (DOI: 10.1214/09-AOAS325SUPP). We provide the 
R code for (1) the simulation study utility part of the three examples, (2) 
the function to compute the disclosure risk, and (3) the calculation of the 
bivariate normal density kernel weight matrix. 
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